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Abstract 


The robust retrieval track explores methods for improving the consistency of retrieval technology by focusing 
on poorly performing topics. The retrieval task in the track is a traditional ad hoc retrieval task where the evalua- 
tion methodology emphasizes a system’s least effective topics. The most promising approach to improving poorly 
performing topics is exploiting text collections other than the target collection such as the web. 

The 2004 edition of the track used 250 topics and required systems to rank the topics by predicted difficulty. The 
250 topics within the test set allowed the stability of evaluation measures that emphasize poorly performing topics 
to be investigated. A new measure, a variant of the traditional MAP measure that uses a geometric mean rather 
than an arithmetic mean to average individual topic results, shows promise of giving appropriate emphasis to poorly 
performing topics while being more stable at equal topic set sizes. 


The ability to return at least passable results for any topic is an important feature of an operational retrieval system. 
While system effectiveness is generally reported as average effectiveness, an individual user does not see the average 
performance of the system, but only the effectiveness of the system on his or her requests. A user whose request 
retrieves nothing of interest is unlikely to be consoled by the fact that the system responds better to other people’s 
requests. 

The TREC robust retrieval track was started in TREC 2003 to investigate methods for improving the consistency 
of retrieval technology. The first year of the track had two main technical results: 


1. The track provided ample evidence that optimizing average effectiveness using the standard Cranfield method- 
ology and standard evaluation measures further improves the effectiveness of the already-effective topics, some- 
times at the expense of the poor performers. 


2. The track results demonstrated that measuring poor performance is intrinsically difficult because there is so 
little signal in the sea of noise for a poorly performing topic. Two new measures devised to emphasize poor 
performers did so, but because there is so little information the measures are unstable. Having confidence in the 
conclusion that one system is better than another using these measures requires larger differences in scores than 
are generally observed in practice when using 50 topics. 


The retrieval task in the track is a traditional ad hoc task. In addition to calculating scores using t rec_eval, each 
run is also evaluated using the two measures introduced in the TREC 2003 track that focus more specifically on the 
least-well-performing topics. The TREC 2004 track differed from the initial track in two important ways. First, the 
test set of topics consisted of 249 topics, up from 100 topics. Second, systems were required to rank the topics by 
predicted difficulty, with the goal of eventually being able to use such predictions to do topic-specific processing. 

This paper presents an overview of the results of the track. The first section describes the data used in the track, 
and the following section gives the retrieval results. Section 3 investigates how accurately systems can predict which 
topics are difficult. Since one of the main results of the TREC 2003 edition of the track was that the poor performance 
is hard to measure with 50 topics, section 4 examines the stability of the evaluation measures for larger topic set sizes. 
The final section looks at the future of the track. 


1 The Robust Retrieval Task 


As mentioned, the task within the robust retrieval track is a traditional ad hoc task. Since the TREC 2003 track had 
shown that 50 topics was not sufficient for a stable evaluation of poorly performing topics, the TREC 2004 track used 


Table 1: Relevant document statistics for topic sets. 


Topic Number of | Mean Relevant Minimum# Maximum# 
Set topics per Topic Relevant Relevant 


200 

49 
Hard 50 
Combined 249 





a set of 250 topics (one of which was subsequently dropped due to having no relevant documents). The topic set 
consisted of 200 topics that had been used in some prior TREC plus 50 topics created for this year’s track. The 200 
old topics were the combined set of topics used in the ad hoc task in TRECs 6-8 (topics 301—450) plus the topics 
developed for the TREC 2003 robust track (topics 601-650). The 50 new topics created for this year’s track are 
topics 651-700. The document collection was the set of documents on TREC disks 4 and 5, minus the Congressional 
Record, since that was the document set used with the old topics in the previous TREC tasks. This document set 
contains approximately 528,000 documents and 1,904 MB of text. 

In the TREC 2003 robust track, 50 of the topics from the 301-450 set were distinguished as being particularly 
difficult for retrieval systems. These topics each had low median average precision scores but at least one high outlier 
score in the initial TREC in which they were used. Effectiveness scores over this topic set remained low in the 2003 
robust track. This topic set is designated as the “hard” set in the discussion below. 

While using old topics allows the test set to contain many topics with at least some of the topics known to be 
difficult, it also means that full relevance data for these topics is available to the participants. Since we could not 
control how the old topics had been used in the past, the assumption was that the old topics were fully exploited in 
any way desired in the construction of a participants’ retrieval system. In other words, participants were allowed to 
explicitly train on the old topics if they desired to. The only restriction placed on the use of relevance data for the old 
topics was that the relevance judgments could not be used during the processing of the submitted runs. This precluded 
such things as true (rather than pseudo) relevance feedback and computing weights based on the known relevant set. 

The existing relevance judgments were used for the old topics; no new judgments of any kind were made for these 
topics. The new topics were judged by creating pools from three runs per group and using the top 100 documents per 
run. There was an average of 704 documents judged for each new topic. The assessors made three-way judgments 
of not relevant, relevant, or highly relevant for the new topics. As noted above, topic 672 had no documents judged 
relevant for it, so it was dropped from the evaluation. An additional 10 topics had no documents judged highly 
relevant. All the evaluation results reported for the track consider both relevant and highly relevant documents as the 
relevant set. Table 1 gives the total number of topics, the average number of relevant documents, and the minimum 
and maximum number of relevant documents for a topic for the four topic sets used in the track. 

While no new judgments were made for the old topics, NIST did form pools for those topics to examine the 
coverage of the original judgment set. Across the set of 200 old topics, an average of 70.8% (minimum 36.6%, 
maximum 93.7%) of the documents in the pools created using robust track runs were judged. Across the 110 runs 
that were submitted to the track, there was an average of 0.3 (min 0.0, max 2.9) unjudged documents in the top 10 
documents retrieved, and 11.2 (min 2.9, max 37.5) unjudged documents in the top 100 retrieved. The runs with the 
largest number of unjudged documents were also the runs that performed the least well. This make sense in that the 
irrelevant documents retrieved by these runs are unlikely to be in the the original judgment set. While it is possible 
that the runs were scored as being ineffective because they had large numbers of unjudged documents, this is unlikely 
to be the case since the same runs were ineffective when evaluated over just the new set of topics. 

Runs were evaluated using t rec_eval, with average scores computed over the set of 200 old topics, the set of 49 
new topics, the set of 50 hard topics, and the combined set of 249 topics. Two additional measures that were introduced 
in the TREC 2003 track were computed over the same four topic sets [11]. The %no measure is the percentage of 
topics that retrieved no relevant documents in the top ten retrieved. The area measure is the area under the curve 
produced by plotting MAP(X) vs. X when X ranges over the worst quarter topics. Note that since the area measure 
is computed over the individual system’s worst X topics, different systems’ scores are computed over a different set 
of topics in general. 


Table 2: Groups participating in the robust track. 


Chinese Academy of Sciences (CAS-NLPR) Fondazione Ugo Bordoni 

Hong Kong Polytechnic University Hummingbird 

IBM Research, Haifa Indiana University 

Johns Hopkins University/APL Max-Planck Institute for Computer Science 


Peking University Queens College, CUNY 
Sabir Research, Inc. University of Glasgow 
University of Illinois at Chicago Virginia Tech 





2 Retrieval Results 


The robust track received a total of 110 runs from the 14 groups listed in Table 2. All of the runs submitted to the track 
were automatic runs, (most likely because there were 250 topics in the test set). Participants were allowed to submit 
up to 10 runs. To have comparable runs across participating sites, one run was required to use just the description field 
of the topic statements, one run was required to use just the title field of the topic statements, and the remaining runs 
could use any combination of fields. There were 31 title-only runs and 32 description-only runs submitted to the track. 
There was a noticeable difference in effectiveness depending on the portion of the topic statement used: runs using 
both the title and description fields were better than using either field in isolation. 

Table 3 gives the evaluation scores for the best run for the top 10 groups who submitted either a title-only run or a 
description-only run. The table gives the scores for the four main measures used in the track as computed over the old 
topics only, the new topics only, the difficult topics, and all 249 topics. The four measures are mean average precision 
(MAP), the average of precision at 10 documents retrieved (P10), the percentage of topics with no relevant in the top 
10 retrieved (%no), and the area underneath the MAP(X) vs. X curve (area). The run shown in the table is the run 
with the highest MAP score as computed over the combined topic set; the table is sorted by this same value. 


2.1 Retrieval methods 


All of the top-performing runs used the web to expand queries [5, 6, 1]. In particular, Kwok and his colleagues had 
the most effective runs in both TREC 2003 and 2004 by treating the web as a large, domain-independent thesaurus 
and supplementing the topic statement by its terms [5]. When performed carefully, query expansion by terms in a 
collection other than the target collection can increase the effectiveness of many topics, including poorly performing 
topics. Expansion based on the target collection does not help the poor performers because pseudo-relevance feedback 
needs some relevant documents in the top retrieved to be effective, and that is precisely what the poorly performing 
topics don’t have. The web is not a panacea, however, in that some approaches to exploiting the web can be more 
harmful than helpful [14]. 

Other approaches to improving the effectiveness of poor performers included selecting a query processing strategy 
based on a prediction of topic effectiveness[15, 8], and reodering the original ranking in a post-retrieval phase [7, 13]. 
Weighting functions, topic fields, and query expansion parameters were selected depending upon the prediction of 
topic difficulty. Documents were reordered based on trying to ensure different aspects of the topic were all represented. 
While each of these techniques can help some topics, the improvement was not as consistent as expanding by an 
external corpus. 


2.2 Difficult topics 


One obvious aspect of the results is that the hard topics remain hard. Evaluation scores when computed over just the 
hard topics are approximately half as good as they are when computed over all topics for all measures except P(10) 
which doesn’t degrade quite as badly. While the robust track results don’t say anything about why these topics are 
hard, the 2003 NRRC RIA workshop [4] performed failure analysis on 45 topics from the 301—450 topic set. As one 
of the results of the failure analysis, Buckley assigned each of the 45 topics into 10 failure categories [2]. He ordered 
the categories by the amount of natural language understanding (NLU) he thought would be required to get good 


Table 3: Evaluation results for the best title-only run (a), and best description-only run (b) for the top 10 groups as 
measured by MAP over the combined topic set. Runs are ordered by MAP over the combined topic set. Values given 
are the mean average precision (MAP), precision at rank 10 averaged over topics (P10), the percentage of topics with 
no relevant in the top ten retrieved (%no), and the area underneath the MAP(X) vs. X curve (area) as computed for 
the set of 200 old topics, the set of 49 new topics, the set of 50 hard topics, and the combined set of 249 topics. 


Old Topic Set New Topic Set Hard Topic Set Combined Topic Set 
pircRB04t3 
fub04Tge 
uic0401 
uogRobSWR10 
vtumtitle 
humR04t5el 
JuruTitSwQE 
SABIRO4BT 
apl04rsTs 
polyutp3 

(a) title-only runs 


pircRB04d4 
fub04Dge 
uogRobDWR10 
vtumdesc 
humR04d4e5 
JuruDesQE 
SABIRO4BD 
wdoqdn1 
apl04rsDw 
polyudp2 





(b) description-only runs 


effectiveness for the topics in that category, and suggested that topics in categories 1-5 should be amenable to today’s 
technology if systems could detect what category the topic was in. More than half of the 45 topics studied during RIA 
were placed in the first 5 categories. 

Twenty-six topics are in the intersection of the robust track’s hard set and the RIA failure analysis set. Table 4 
shows how the topics in the intersection were categorized by Buckley. Seventeen of the 26 topics in the intersection 
are in the earlier categories, suggesting that the hard topic set should not be a hopelessly difficult topic set. 


3 Predicting difficulty 


A necessary first step in determining the problem with a topic is the ability to recognize whether or not it will be 
effective. Obviously, to be useful the system needs to be able to make this determination at run time and without 
any explicit relevance information. Cronen-Townsend, Zhou, and Croft suggested the clarity measure, the relative 
entropy between a query language model and the corresponding collection language model, as one way of predicting 
the effectiveness of a query [3]. The robust track required systems to rank the topics in the test set by predicted 
difficulty to explore how capable systems are at recognizing difficult topics. A similar investigation in the TREC 
2002 question answering track demonstrated that accurately predicting whether a correct answer was retrieved is a 
challenging problem [10]. 

In addition to including the retrieval results for each topic, a robust track run ranked the topics in strict order from 
1 to 250 such that the topic at rank 1 was the topic the system predicted it had done best on, the topic at rank 2 
was the topic the system predicted it had done next best on, etc. This ranking was the predicted ranking. Once the 
evaluation was complete, the topics were ranked from best to worst by average precision score; this ranking was the 


Table 4: Failure categories of hard topics. 


Category 
a Category gloss Topics 


| 2 | general technical failures such as stemming | 353,378 378 


systems all emphasize one aspect, miss another re- raa 419, 445 
quired term 


| 4 | systems all emphasize one aspect, miss another aspect | 350, 355, 372, 408, 409, 435, 443 


5 some systems emphasize one aspect, some another, | 307, 310, 330, 363, 436 
need both 


systems all emphasize some irrelevant aspect, missing | 347 
a ee 
7 need outside expansion of “general” term (e.g., expand | 401, 443, 448 
| Europe to individual countries) 


need query analysis to determine relationship between | 414 
query terms 


| 9 | systems missed difficult aspect 362, 367, 389, 393, 401, 404 





actual ranking. 

One measure for how well two rankings agree is Kendall ’s 7 [9]. Kendall ’s 7 measures the similarity between 
two rankings as a function of the number of pairwise swaps needed to turn one ranking into the other. The 7 ranges 
between -1.0 and 1.0 where the expected correlation between two randomly generated rankings is 0.0, and a 7 of 1.0 
indicates perfect agreement. The run with the largest 7 between the predicted and actual ranking was the uic0401 
run with a 7 of 0.623. Fourteen of the 110 runs submitted to the track had a negative correlation between the predicted 
and actual rankings. (The topic that was dropped from the evaluation was also removed from the rankings before the 
T was computed.) 

The Kendall’s 7 score between the predicted and actual ranking for a run is given as part of the run’s description in 
the Appendix of these proceedings. Unfortunately, Kendall’s 7 between the entire predicted and actual rankings is not 
a very good measure of whether a system can recognize poorly performing topics. The main problem is that Kendall’s 
T is sensitive to any difference in the rankings (by design). But for the purposes of predicting when a topic will be a 
poor performer, small differences in average precision don’t matter, nor does the actual ranking of the very effective 
topics. 

A more accurate representation of how well systems predict poorly performing topics is to look at how MAP scores 
change when successively greater numbers of topics are eliminated from the evaluation. The idea is essentially the 
inverse of the area measure: instead of computing MAP over the X worst topics, compute it over the best Y topics 
where Y = 249...199 and the best topics are defined as the first Y topics in either the predicted or actual ranking. 
The difference between the two curves produced using the actual ranking on the one hand and the predicted ranking on 
the other is the measure of how accurate the predictions are. Figure 1 shows these curves plotted for the uic0401 run, 
the run with the highest Kendall correlation, on the left and the humR0O4d5 run, the run with the (second!) smallest 
difference between curves, on the right. In the figure, the MAP scores computed when eliminating topics from the 
actual ranking are plotted with circles and scores using the predicted ranking are plotted with triangles. 

Figure 2 shows a scatter plot of the area between the MAP curves versus the Kendall 7 between the rankings for 
each of the 110 runs submitted to the track. If the 7 and area-between-MAP-curves agreed as to which runs made 
good predictions, the points would lie on a line from the upper left to the lower right. While the general tendency is 
roughly in that direction, there are enough outliers to argue against using Kendall’s 7 over the entire topic ranking for 
this purpose. 

Figure 2 also shows that there is quite a range in systems’ abilities to predict which topics will be poor performers 
for them. Twenty-two of the 110 runs representing 5 of the 14 groups had area-between-MAP-curves scores of 0.5 
or less. Thirty runs representing six groups (all distinct from the first group) had area-between-MAP-curves scores 


'The run with the smallest difference was an ineffective run where almost all topics had very small average precision scores. 
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Figure |: Effect of differences in actual and predicted rankings on MAP scores. 








s 
0.54 EN e. 
% Ps N 
e 
“3%, 
See oe 
s e e . 
& . ote : 
= . ot” @ee e è e e 
E e 
0.0 T E; oe T 
1 2 
e e 
e e. 
ee 
e . è 
e 
-0.5 4 


Area between MAP curves 


Figure 2: Scatter plot of area-between-MAP-curves vs. Kendall’s 7 for robust track runs. 


of greater than 1.0 How much accuracy is required—including whether accurate predictions can be exploited at all— 


remains to be seen. 


4 Evaluating Ineffectiveness 


Most TREC topic sets contain 50 topics. In the TREC 2003 robust track we showed that the %no and area measures 
that emphasize poorly performing topics are unstable when used with topic sets as small as 50 topics. The problem is 
that the measures are defined over a subset of the topics in the set causing them to be much less stable than traditional 
measures for a given topic set size. In turn, the instability causes the margin of error associated with the measures to 


Table 5: Error rate and proportion of ties for different measures and topic set sizes. 


Bie. 50 Topics 75 Topics 100 Topics 124 Topics 


Error Proportion Error Proportion Error Proportion Error Proportion 
Rate a of Ties Rate a of Ties Rate o of Ties Rate “a of Ties 





be large relative to the difference in scores observed in practice. 


4.1 Stability of %no and area measure 


The motivation for using 250 topics in the this year’s track was to test the stability of the measures on larger topic set 
sizes. The empirical procedures to compute the error rates and error margins are the same as were used in the 2003 
track [11] except the topic set size is varied. Since the combined topic set contained 249 topics, topic set sizes up to 
124 (half 249) can be tested. 

Table 5 shows the error rate and proportion of ties computed for the four different measures used in table 3 and 
four different topic set sizes: 50, 75, 100, and 124. The error rate shows how likely it is that a single comparison of two 
systems using the given topic set size and evaluation measure will rank the systems in the wrong order. For example, 
an error rate of 3% says that in 3 out of 100 cases the comparison will be wrong. Larger error rates imply a less stable 
measure. The proportion of ties indicates how much discrimination power a measure has; a measure with a low error 
rate but a high proportion of ties has little power. 

The error rates computed for topic set size 50 are somewhat higher than those computed for the TREC 2003 track, 
probably reflecting the greater variety of topics the error rate was computed from. The general trends in the error 
rates are strong and consistent: error rate decreases as topic set size increases, and the %no and area measures have a 
significantly higher error rate than MAP or P(10) at equal topic set sizes. 

Using the standard of no larger than a 5% error rate, the area measure can be used with test sets of at least 124 
topics, while the %no measure requires still larger topics sets. Note that since the area measure is defined using the 
worst quarter topics, a 124 topic set size implies the measure is using 31 topics in its computation. While this is good 
for stability, it is no longer as focused on the very poor topics. 

The error rates shown in table 5 assumed two runs whose difference in score was less than 5% of the larger score 
were equally as effective. By using a larger value for the difference before deciding two runs are different, we can 
decrease the error rate for a given topic set size (because the discrimination power is reduced) [12]. Table 6 gives 
the critical value required to obtain no more than a 5% error rate for a given topic set size. For the area measure, the 
critical value is the minimum difference in area scores needed. For the %no measure, the critical value is the number of 
additional questions that must have no relevant in the top ten, also expressed as a percentage of the total topic set size. 
Also given in the table is the percentage of the comparisons that exceeded the critical value when comparing all pairs 
of runs submitted to the track over all 1000 topic sets used to estimate the error rates. This percentage demonstrates 
how sensitive the measure is to score differences encountered in practice. 

The sensitivity of the %no measure does increase with topic set size, but the sensitivity is still very poor even at 
124 topics. While intuitively appealing, this measure is just too coarse to be useful unless there are massive numbers 
of topics. Note that the same argument applies to the “Success @ 10” measure (i.e., the number of topics that retrieve 
a relevant document in the top 10 retrieved) that is being used to evaluate tasks such as home page finding and the 
document retrieval phase of question answering. 

The sensitivity of the area measure is more reasonable. The area measure appears to be an acceptable measure for 
topic set sizes of at least 100 topics, though as mentioned above, its emphasis on the worst performing topics lessens 
as topic size grows. 


Table 6: Sensitivity of measures: given is the critical value required to have an error rate no greater than 5% plus the 
percentage of comparisons over track run pairs that exceeded the critical value. 


W 50 Topics 75 Topics 100 Topics 124 Topics 


Critical % Critical % Critical % Critical % 
Value Significant Value Significant Value Significant Value Significant 


Yno | 11 (22%) 3.8 16 (21%) 3.9 11 (10%) 15.7 13 (10%) 16.3 
area 0.025 16.5 0.020 38.6 0.015 62.4 0.015 68.8 





Table 7: Evaluation scores for the runs of Figure 3. 


geometric 
MAP MAP P10 area | %no 





pircRBO4td2 0.359 0.263 | 0.541 | 0.047 4 
NLPRO4clus10 | 0.306 0.230 | 0.449 | 0.048 8 
uogRobLWR10 0.320 0.176 | 0.448 | 0.015 11 


4.2 Geometric MAP 


The problem with using MAP as a measure for poorly performing topics is that changes in the scores of better- 
performing topics mask changes in the scores of poorly performing topics. For example, the MAP of a run in which 
the effectiveness of topic A doubles from 0.02 to 0.04 while the effectiveness of topic B decreases 5% from 0.4 to 
0.38 is identical to the baseline run’s MAP. This suggests using a nonlinear rescaling of the individual topics’ average 
precision scores before averaging over the topic set as a way of emphasizing the poorly performing topics. 

The geometric mean of the individual topics’ average precision scores has the desired effect of emphasizing scores 
close to 0.0 (the poor performers) while minimizing differences between larger scores. The geometric mean is equiva- 
lent to taking the log of the the individual topics’ average precision scores, computing the arithmetic mean of the logs, 
and exponentiating back for the final geometric MAP score. Since the average precision score for a single topic can 
be 0.0—and t rec_eval reports scores to 4 significant digits—we take the expedient of adding 0.00001 to all scores 
before taking the log (and then subtracting 0.00001 from the result after exponentiating). 

To understand the effect of the various measures, Figure 3 shows a plot of the individual topic average precision 
scores for three runs from the TREC 2004 robust track. For each run, the average precision scores are sorted by 
increasing score and plotted in that order. Thus the x-axis in the figure represents a topic rank and the y-axis is the 
average precision score obtained by the topic at that rank. The three runs were selected to illustrate the differences 
in the measures. The pircRBO4td2 run was the most effective run as measured by both standard MAP over all 
249 topics and geometric MAP over all 249 topics. The NLPRO4c1lus10 run has relatively few abysmal topics and 
also relatively few excellent topics, while the uogRobLWR10 run has relatively many of both abysmal and excellent 
topics. The evaluation scores for these three runs are given in Table 7. The uogRobLWR10 run has a better standard 
MAP score than the NLPRO4clus10 run, and a worse area and geometric MAP score. The P(10) score for the two 
runs are essentially identical. 

Table 8 shows that the geometric mean measure is also a stable measure. The table gives the error rate and 
proportion of ties for geometric MAP for various topic set sizes. As in Table 5, the geometric MAP’s error rates are 
computed assuming a difference in scores less than 5% of the larger score is a tie. Compared to the error rates for the 
measures given in Table 5, geometric MAP’s error rate is larger than both standard MAP and P(10) for equal topic 
set sizes, but much reduced compared to the area and %no measures. The geometric MAP measure has the additional 
benefit over the area measure of being less complex. Given just the geometric MAP scores for a run over two sets of 
topics, the geometric MAP score for that run on the combined set of topics can be computed, which is not the case for 
the area measure. 
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Figure 3: Individual topic average precision scores for three TREC 2004 runs. 


Table 8: Error rate and proportion of ties computed over different topic set sizes for the geometric MAP measure. 





5 Conclusion 


The first two years of the TREC robust retrieval track have focused on trying to ensure that all topics obtain minimum 
effectiveness levels. The most promising approach to accomplishing this feat is exploiting text collections other than 
the target collection, usually the web. Believing that you cannot improve that which you cannot measure, the track 
has also examined evaluation measures that emphasize poorly performing topics. The geometric MAP measure is the 
most stable measure with a suitable emphasis. 

The robust retrieval track is scheduled to run again in TREC 2005, though the focus of the track is expected to 
change. The current thinking is that the track will test the robustness of ad hoc retrieval technology by examining how 
stable it is in face of changes to the retrieval environment. To accomplish this, participants in the robust track will 
be asked to use their system for the ad hoc task in at least two of the other TREC tracks (for example, genomics and 
terabyte or terabyte and HARD). Within the robust track, same-system runs will be contrasted to see how differences in 
the tasks affect performance. Runs will also be evaluated using existing robust track measures, particularly geometric 
MAP. 
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